library(knitr)
library(tidyverse)
library(ggplot2)
library(ggpubr)
library(plotly)
library(rstatix)
library(reshape2)
library(car)
library(kableExtra)
hook_output = knit_hooks$get('output')
knit_hooks$set(output = function(x, options) {
  # this hook is used only when the linewidth option is not NULL
  if (!is.null(n <- options$linewidth)) {
    x = knitr:::split_lines(x)
    # any lines wider than n should be wrapped
    if (any(nchar(x) > 120)) x = strwrap(x, width = 120)
    x = paste(x, collapse = '\n')
  }
  hook_output(x, options)
})

Guided part

1: Read in the gapminder_clean.csv data as a tibble using read_csv.

gapminder <- read_csv("gapminder_clean.csv")

2: Filter the data to include only rows where Year is 1962 and then make a scatter plot comparing ‘CO2 emissions (metric tons per capita)’ and gdpPercap for the filtered data.

gapminder %>% 
  filter(Year == 1962) %>%
  ggplot(aes(x = gdpPercap ,y = `CO2 emissions (metric tons per capita)`)) + geom_point() + xlab("GDP per capita (log10)") +
  ylab(bquote("CO"[2]*" emissions in mtpc (log10)")) + 
  scale_x_log10() + scale_y_log10() + ggtitle(bquote("CO"[2]*" emissions per capita vs GDP per capita")) + theme(plot.title = element_text(hjust = 0.5))

We can see that, as expected, there was a considerable positive correlation between CO2 emissions and GDP per capita in 1962.

3: On the filtered data, calculate the pearson correlation of ‘CO2 emissions (metric tons per capita)’ and gdpPercap. What is the Pearson R value and associated p value?

The Pearson correlation method needs data to be homoscedastic. Let’s visually inspect whether this condition is met (a regression line and residual values were used to determine the zoom and rotation applied to the raw graph).

gapminder %>%
  summarise(Fitted = fitted(lm(`CO2 emissions (metric tons per capita)`~gdpPercap)),Residuals = resid(lm(`CO2 emissions (metric tons per capita)`~gdpPercap))) %>%
  ggplot(aes(x=Fitted,y=Residuals)) + geom_point() + ggtitle("Residuals vs Fitted") + theme(plot.title = element_text(hjust = 0.5))

Variance is clearly greater as values grow (the leftmost part, which comprises most observations, is cone-shaped), so the data is clearly not homoscedastic. We will thus use Spearman’s correlation coefficient, which doesn’t assume homoscedasticity.

value <- gapminder %>%
  summarise(cor = cor(`CO2 emissions (metric tons per capita)`,gdpPercap,method = "spearman", use = "complete.obs"))

The spearman correlation coefficient between CO2 emissions and GDP per capita is 0.9128636.

4: On the unfiltered data, answer “In what year is the correlation between ‘CO2 emissions (metric tons per capita)’ and gdpPercap the strongest?” Filter the dataset to that year for the next step…

value <- gapminder %>%
  group_by(Year) %>%
  summarize(Correlation = cor(gdpPercap,`CO2 emissions (metric tons per capita)`,use = "complete.obs",method = "spearman")) %>%
  slice_max(abs(Correlation))

The year with the strongest correlation between GDP per capita and CO2 emissions per capita is 2002, with a Spearman correlation coefficient of 0.94. We can observe the trend in the following figure:

gapminder %>%
  group_by(Year) %>%
  summarize(Correlation = cor(gdpPercap,`CO2 emissions (metric tons per capita)`,use = "complete.obs",method = "spearman")) %>%
  ggplot(aes(x = factor(Year),y = Correlation)) + geom_bar(stat = "identity",fill = "#40e2ff") + 
  ggtitle(bquote("Correlation between CO"[2]*" emissions and GDP 1962-2007 (per capita)")) + xlab("Year")

5: Using plotly, create an interactive scatter plot comparing ‘CO2 emissions (metric tons per capita)’ and gdpPercap, where the point size is determined by pop (population) and the color is determined by the continent. You can easily convert any ggplot plot to a plotly plot using the ggplotly() command.

p <- gapminder %>% 
  filter(Year == 1962) %>%
  ggplot(aes(x = gdpPercap ,y = `CO2 emissions (metric tons per capita)`)) + geom_point(aes(size = pop,colour = continent)) + ggtitle("CO<sub>2</sub> emissions per capita vs GDP per capita") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  scale_x_log10() +    scale_y_log10() + 
  xlab("GDP per capita (log10)") + ylab("CO<sub>2</sub> emissions in mtpc (log10)")
ggplotly(p)

We can see that African, American and Asian countries are at the bottom in both emissions and GDP per capita. European countries, with some exception, are more rich and carbonic anhydride emitting.The two American outliers must be Canada and the United States of America.

data_frame <- as.data.frame(gapminder)

Unguided part

1: What is the relationship between continent and ‘Energy use (kg of oil equivalent per capita)’? (stats test needed)

Before applying a statistical test we will visualize the distribution of energy use in each continent through the use of boxplots. We have used only the values for 2007 for the amount of outliers to be readily interpretable. You can hover your mose over a point in the interactive graph to see which country it is representing.

p <- gapminder %>%
  filter(!is.na(continent), Year == 2007) %>%
  ggplot(aes(x = continent,y = `Energy use (kg of oil equivalent per capita)`)) + geom_boxplot() + geom_point(aes(color = `Country Name`))+
  theme(legend.position = "none",plot.title = element_text(hjust= 0.5)) + ggtitle("Energy use per continent") + xlab("Continent") + ylab("Energy use (kgoe)")
  
ggplotly(p)

Europe and Oceania are the most energy consuming continents, followed by Asia and the Americas. Asian outliers are rich countries, due to oil or otherwise. The three clear American outliers are the two rich North American countries and the industry intensive Trinidad & Tobago. Africa presents the lower values, with South Africa and resource rich Lybia and Equatorial Guinea as outliers.

If the data are normally distributed and homoscedastic, an ANOVA test would be appropriate. Let’s check those assumptions.

gapminder %>%
  filter(!is.na(continent),!is.na(`Energy use (kg of oil equivalent per capita)`)) %>%
  group_by(continent) %>%
  summarize(Shapiro =shapiro.test(`Energy use (kg of oil equivalent per capita)`)[2]$p.value)

Only Oceania seems to be normally distributed. The null hypothesis (normality) is rejected. We shall thus use the Kruskal-Wallis test.

p_value <- gapminder %>%
  filter(!is.na(continent), !is.na(`Energy use (kg of oil equivalent per capita)`)) %>%
  summarise(kruskal = kruskal.test(`Energy use (kg of oil equivalent per capita)`~continent)$p.value)

The null hypothesis (equality of means) is rejected.

This test only refutes the hypothesis that all means are equal but doesn’t tell us if the difference between two specific continents is significative. In order to ascertain that we used a Dunn test, which showed that all differences were significant except for the differences between Asia and the Americas and Europe and Oceania. The results are shown in the following table.

gapminder %>%
  filter(!is.na(continent), !is.na(`Energy use (kg of oil equivalent per capita)`)) %>%
  dunn_test(`Energy use (kg of oil equivalent per capita)`~ continent) %>%
  mutate(Significative = ifelse(p.adj.signif == "ns","No","Yes")) %>%
  rename(`Continent 1` = group1 ,`Continent 2` = group2) %>%
  select(`Continent 1`,`Continent 2`,Significative) %>%
  kable(align = "c") %>%
 kable_styling()
Continent 1 Continent 2 Significative
Africa Americas Yes
Africa Asia Yes
Africa Europe Yes
Africa Oceania Yes
Americas Asia No
Americas Europe Yes
Americas Oceania Yes
Asia Europe Yes
Asia Oceania Yes
Europe Oceania No

2: Is there a significant difference between Europe and Asia with respect to ‘Imports of goods and services (% of GDP)’ in the years after 1990? (stats test needed)

As in the previous section, we will visualize before applying a statistical test:

gapminder %>%
  filter(!is.na(`Imports of goods and services (% of GDP)`), Year > 1990) %>%
  filter(continent == "Europe" | continent == "Asia") %>%
  group_by(Year) %>%
  rename(Continent = continent) %>%
  ggplot(aes(x = as.factor(Year),y = `Imports of goods and services (% of GDP)`,fill = Continent)) + 
  geom_violin() + labs(x = "Year",y = "Imports of G&S (% of GDP)") +
  ggtitle("Evolution of imports of G&S (% of GDP)") + theme(plot.title = element_text(hjust= 0.5))

The graph would seem to show that mean European country imports as % of GDP have been growing faster than mean Asian country imports as % of GDP, almost converging (with Asian % of import being more variable) in 2007.

We will plot an histogram in order to see whether there is obvious non-normality.

gapminder %>%
  filter(!is.na(`Imports of goods and services (% of GDP)`), Year > 1990) %>%
  filter(continent == "Europe") %>%
  ggplot(aes(x = `Imports of goods and services (% of GDP)`)) + geom_histogram() +
  ggtitle("Distribution of imports of G&S (% of GDP) in Europe") +
  theme(plot.title = element_text(hjust = 0.5))

Data for Europe is clearly not normally distributed, so we will use the Wilcoxon non-parametric test.

value <- gapminder %>%
  filter(!is.na(`Imports of goods and services (% of GDP)`), Year > 1990) %>%
  filter(continent == "Europe" | continent == "Asia") %>%
  summarise(wilcox.test(formula = `Imports of goods and services (% of GDP)` ~ continent)$p.value)

The null hypothesis (equality of mean imports as a % of GDP between Europe & Asia) can’t be rejected.

3: What is the country (or countries) that has the highest ‘Population density (people per sq. km of land area)’ across all years? (i.e., which country has the highest average ranking in this category across each time point in the dataset?)

tabla <- gapminder %>%
  drop_na(`Population density (people per sq. km of land area)`) %>%
  group_by(Year) %>%
  mutate(ranking = min_rank(desc(`Population density (people per sq. km of land area)`))) %>%
  ungroup() %>%
  group_by(`Country Name`) %>%
  summarize(mean_ranking = mean(ranking)) %>%
  arrange(mean_ranking) %>%
  slice_min(mean_ranking) %>%
  select(`Country Name`)

Macao SAR, China was the country with a highest average population density ranking.

4: What country (or countries) has shown the greatest increase in ‘Life expectancy at birth, total (years)’ since 1962?

tabla <- gapminder %>%
  filter(Year == 1962 | Year == 2007) %>%
  group_by(`Country Name`) %>%
  filter(!any(is.na(`Life expectancy at birth, total (years)`)), n() > 1) %>%
  arrange(Year) %>%
  mutate(Difference = `Life expectancy at birth, total (years)`[Year == 2007] - `Life expectancy at birth, total (years)`[Year == 1962]) %>%
  select(Difference) %>%
  ungroup() %>%
  arrange(Difference) %>%
  slice_tail()

The country featuring the greatest increase of life expectancy at birth in the 1962-2007 period was Maldives, with an increase of 36.92 years.

gapminder %>%
  ggplot(aes(y = `CO2 emissions (metric tons per capita)`,x =gdpPercap)) + geom_point()